Covariance and correlation


In [1]:
%pylab inline


Populating the interactive namespace from numpy and matplotlib

Covariance

Measures how two variables vary in tandem from their means. To measure the covariance we take a variable that consist of a multidimensional vector and convert it into variances from the mean. We will have a variances vector. To compute the covariance between two variables we simply compute the dot product of the variance vectors and divide by the sample size.

$\operatorname{cov} (X,Y)=\frac{1}{n}\sum_{i=1}^n (x_i-E(X))(y_i-E(Y)).$


In [2]:
# generate two random variables
x = np.random.normal(3.0, 1.0, 1000)
y = np.random.normal(50.0, 10.0, 1000)

# calculate variance vectors
x_var = [i - x.mean() for i in x]
y_var = [i - y.mean() for i in y]

# compute dot product (angle between two vectors)
# n-1 because we're going for sample covariances
# we would use n if we want to compute the covariance of the population
n = len(x)
print np.dot(x_var,y_var)/( n-1 ) # np.cov(x,y)

# plot both variables
plt.scatter(x,y, color='gray');plt.xlabel('X'); plt.ylabel('Y')


-0.134978391004
Out[2]:
<matplotlib.text.Text at 0x7f15d94c2210>
/usr/lib/pymodules/python2.7/matplotlib/collections.py:548: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == 'face':

Covariance is hard to quantify, because a small covariance (close to zero) means that there is not much correlation between variables, but not how much. This is where correlation comes.

Correlation

Because correlation just divide the covariance by the standard deviations of both variables, we normalize them. This means that a correlation of 1 means correlation, and a correlation of -1 is perfect inverse correlation. Zero means no correlation at all.

Remember that correlation does not imply causation!


In [ ]:
y = np.random.normal(50.0, 10.0, 1000)/x
plt.scatter(x,y, color='gray'); plt.xlabel('X'); plt.ylabel('Y');

Covariance is sensitive to the units used in the variables, which makes it difficult to interpret. Correlation normalizes everything by their standard deviations, giving you an easier to understand value that ranges from -1 (for a perfect inverse correlation) to 1 (for a perfect positive correlation):


In [ ]:
# to calculate the correlation we need to compute the covariance
x_var = [i-x.mean() for i in x]
y_var = [i-y.mean() for i in y]


covar = (np.dot(x_var,y_var))/(len(x)-1)

covar/x.std()/y.std()

In [ ]:
# to calculate with NumPy
np.corrcoef(x,y)

In [ ]:
np.corrcoef(y,x)

In [ ]:


In [ ]:


In [ ]: